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Abstract 

In Automatic Text Summarization, preprocessing is an important phase to reduce the 
space of textual representation. Classically, stemming and lemmatization have been widely 
used for normalizing words. However, even using normalization on large texts, the curse 
of dimensionality can disturb the performance of summarizers. This paper describes a 
new method for normalization of words to further reduce the space of representation. 
We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The 
results show that Ultra-stemming not only preserve the content of summaries produced 
by this representation, but often the performances of the systems can be dramatically 
improved. Summaries on trilingual corpora were evaluated automatically with Fresa. 
Results confirm an increase in the performance, regardless of summarizer system used. 
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1 Introduction 

In Natural Language Processing (NLP), pre-processing aims to reduce the complexity of the 
vocabulary of the documents. Pre-processing eliminates the punctuation, filters the function 
words and normalizes the morphological variants. In particular, the lemmatization and stem- 
ming are two commonly used techniques to normalize morphological variants. 

The lexeme or word-root is the part that does not change and contains its meaning. The 
morpheme or variable part is added to the lexeme to form new words. Morphological analysis 
is a very important phase of pre-processing of NLP systems because it allows to reduce the 
dimension of the vector space representation in systems of Information Retrieval |31 132] . Several 
applications such as Automatic Summarization, Document Indexing, Textual Classification and 
Question- Answering systems among othersjSl, utilize this reduction. However, the realization of 
this analysis may require the use of external resources (dictionaries, parsers, rules, etc.) which 
can be expensive and difficult to build, depending on language or specific domain (3^. Some 
algorithms are capable to detect statistically morphological families (posed as a classification 
problem), avoiding the utilization of external resources or a priori knowledge of a language. 

Automatic Text Summarization (ATS) is the process to automatically generate a compressed 
version of a source document [JT]. Query-oriented summaries focus on a user's request, and 
extract the information related to the specified topic given explicitly in the form of a query 
|11| . Generic mono-document summarization tries to cover as much as possible the informa- 
tion content. Multi-document summarization is a task oriented to creating a summary from 



a heterogeneous set of documents on a focused topic. Over the past years, extensive exper- 
iments on query-oriented multi-document summarization have been carried out. Extractive 
Summarization produces summaries choosing a subset of representative sentences from original 
documents. Sentences are ordered, then assembled according to their relevance to generate the 
final summary |31| . 

This article introduces a new method of normalization of words that reduces the textual 
representation space, in order to improve the efficiency of Automatic Text Summarizers based 
on Vector Space Model (VSM). We propose Ultra-stemming which reduces every word(s) to 
its initial(s) letter(s). Results show that Ultra-stemming not only preserves the content of the 
summaries generated using this new representation, but often, surprisingly the performance can 
be dramatically improved. To our knowledge, in summary tasks no automatic stemming method 
has explored this extreme possibility. Ultra-stemming could be an interesting alternative for 
ATS of documents in languages tt, where electronic linguistic resources are rare. In these 
languages, there are a notable absence of lemmatizers, stemmers, parsers, dictionaries, corpora 
and language resources in general (such as Nahuatl and other American Indian languages). 
Our tests on trilingual corpora evaluated by the Fresa algorithm confirm the increase of 
performance regardless of summarizer used and a big reduction of complexity in space and time 
required to generate summaries. Related work is given in Section[2] Section|3]presents our Ultra- 
stemming strategies coupled with methods of Automatic Text Summarization. Experiments are 
presented in Section [5] followed by a discussion and the conclusions in Section [6j 



2 Related works 

There are several morphological analysis methods |20lCT| . Examples of these algorithms are the 
Comparison of Graphs jTH] , the use of n-grams [T^ [55] , the search for analogies |27] , the surface 
models based on rules |25[ [37] , the probabilistic models [TO] , the segmentation by optimization 
[HI IIH]j the unsupervised learning of morphological families by ascending hierarchical classifi- 
cation [3], the lemmatization using Levenshtein distances [T2] or identifying suffixes through 
entropy |33]- These methods are distinguished by the type of results obtained, by the identifica- 
tion of lemmas, stems or suffixes. Flemm^ m] is an analyzer for French which requires a text 
previously labeled by WinBrill^ or by TreeTagger^. Flemm produces, among other results, 
the lemma of each word of the input text. Treetagger [25] is a multilingual tool that allows 
to annotate texts with information of Parts-Of-Speech (POS) and with information of lemma- 
tization. TreeTagger uses supervised machine learning and probabilistic methods [TJinH]- It 
can be adapted to other languages as long as the lexical resources and manually labeled corpora 
are available. FreeLing is another example of a popular multilingual lemmatizer^. 

Stemming transforms the variants of words into truncated forms. Two popular stemming 
algorithms are the Porter stemming algorithm [37] and the Paice algorithm [35 . The methods 
of stemming and lemmatization can be applied when the terms are morphologically similar. 
Otherwise when the similarity is semantic, lexical search methods must be used. To reduce se- 
mantic variation, some systems use long dictionaries. Another systems use thesauri to associate 
words to entirely different morphological forms [36]. Both methods are complementary since 



Flemm is available in web site: 



http : / / www . univ-nancy2 ■ f r/ per s/namer/Telecharger_Flermn . htm 



^WinBrill is available in web site: jhttp: //www. atllf . f r/scripts/mep. exe?HTHL=mep_wlnbrill . txt ; j 
|QUVRIR_MENU=1 

^'i'REETAGGER is available in web site: jhttp : //www. ims .uni- Stuttgart .de/projekte/corplex/TreeTagger/j 
pecisionTreeTagger .html 

"i'he types of words are, for examp le, nouns, verbs, infinitives and particles. 



5t 



FreeLing is available in web site: http://www.lsi.upc.edu/~nlp/freeling/ 



the stemming verifies similarities in the spelling level to infer lexical proximity, while the lexical 
algorithms use terminographic data with links to synonyms. |24| . |17j presents an unsuper- 
vised genetic algorithm for stemming inflectional languages. |46] proposes using morphological 
merged families into a single term to reduce the linguistic variety of Spanish indexed texts. 

Lexematization pM) seeks morphological rearrangement of words belonging to the same fam- 
ily using automatic acquisition of morphological knowledge directly from the texts. Although 
the constitution on morphological families may be interesting in itself, its main interest lies in 
the benefits it produces for use as normalization mechanism (instead or in addition to stemming 
or lemmatization) in specific application domains. Probably the most common application do- 
main is indexing terms in systems of Information Retrieval (IR) . In recent years there have been 
numerous articles analyzing in different languages the efficiency of stemming/lemmatization in 
IR. In addition, significant progress has been made in IR in European languages other than 
English. In particular, [23J have evaluated corpora of CLEF evaluation campaigns ^ (eight 
European languages). Their results show that morphological normalization techniques increase 
the efficiency of the IR systems and it can be used independently of the language. Reduction 
algorithms using syntactic and morphosyntactic variations have shown a significant reduction 
of storage costs and management by storing lexemes rather than terms ||45]. [T] works on 
the impacts of compound words and standardization in IR, finding no significant performance 
differences between stemming and lemmatization. 

However, the reality is that the linguistic resources necessary to establish morphological 
relationships without pre-defined rules are not available for all languages and all domains, 
without mention the constant creation of neologisms [8] . The proposed solution for the specific 
task of automatic summarization is the Ultra-stemming of letters. 

Research in ATS was introduced by H.P. Luhn in 1958 jSHl- In the strategy proposed by 
Luhn, the sentences are scored for their component word values as determined by tf*idf-like 
weights. Scored sentences are then ranked and selected from the top until some summary length 
threshold is reached. Finally, the summary is generated by assembling the selected sentences 
in original source order. Although fairly simple, this extractive methodology is still used in 
current approaches. Later on, |il3j extended this work by adding simple heuristic features of 
sentences such as their position in the text or some key phrases indicating the importance of 
the sentences. As the range of possible features for source characterization widened, choosing 
appropriate features, feature weights and feature combinations have became a central issue. A 
natural way to tackle this problem is to consider sentence extraction as a classification task. 
To this end, several machine learning approaches that uses document-summary pairs have been 
proposed t26j (40^ . 

3 Pre-processing and Ultra-stemming 

The following subsections present formally the details of the corpora studied and the proposed 
text pre-processing method. 

3.1 Summarization Corpora Description 

To study the impact of Ultra-stemming in automatic summary tasks, we used corpora in three 
languages: English, Spanish and French. The corpora are heterogeneous, and different tasks 
are representive of Automatic Summarization: generic multi-document summary and mono- 
document guided by a subject. 

® Cross-Language Evaluation Forum, Ihttp: //www. clef -campaign, org/ 



• Corpus in English. Piloted by NIST in Document Understanding Conference^ (DUC) the 
Task 2 of DUC'04^, aims to produce a short summary of a cluster of related documents. 
We studied generic multi-document-summarization in English using data from DUC'04. 
This corpus with 300K words is compound of 50 clusters, 10 documents each. 

• Corpus in Spanish. Generic single-document summarization using a corpus from the 
journal Medicina Clmica^, which is composed of 50 medical articles in Spanish, each one 
with its corresponding author abstract. This corpus contains 125K words. 

• Corpus in French. We have studied generic single-document summarization using the 
Canadian French Sociological Articles corpus, generated from the journal Perspectives 
interdisciplinaires sur le travail et la saute (Pistes) It contains 50 sociological articles 
in French, each one with its corresponding author abstract. This corpus contains near 
400K words. 

Table [T] presents the basic statistics on tokens, types and characters of the three summa- 
rization corpora studied. 



Corpus 


Language 


Tokens 


Types 


Letters 


DUC'04 


English 


294 236 


17 780 


1 834 167 


Medicina Clinica 


Spanish 


125 024 


9 657 


793 937 


Pistes 


French 


380 992 


18 887 


2 590 623 



Table 1: Basic Statistics for the three Summarization corpora. 



Additionally, three large and heterogeneous corpora (generated from novels, newspaper ar- 
ticles and news on the Internet) were created to measure statistics of each language. These 
corpora contains several million tokens in English, Spanish and French. Table [2] presents basic 
statistics on tokens and characters of the three generic corpora. 



Generic Corpus 


Tokens 


Letters 


English 


29 


346 


289 


177 717 720 


Spanish 


21 


445 


694 


134 461 092 


French 


17 


734 


663 


111 169 782 



Table 2: Basic Statistics for the three Language Generic corpora. 
3.2 Ultra-stemming 

The first step to represent documents in a suitable space is the pre-processing. As we use 
extractive summarization as task, documents have to be chunked into cohesive textual segments 
that will be assembled to produce the summary. Pre-processing is very important because the 
selection of segments is based on words or bigrams of words. The choice was made to split 
documents into full sentences, in this way obtaining textual segments that are likely to be 
grammatically correct. Afterwards, sentences pass through several basic normalization steps in 
order to reduce computational complexity. An example of document pre-processing is given in 
Table |3] The process is composed by the following steps: 

'http : / /due .nist .gov 

*http : //www- nlpir .nist .gov/projects/duc/guidelines/2004. html 

"http : //www. elsevier . es/revistas/ctl_servlet?_f =7032&revistaid=2 
^Shttp: //www. pistes .uqam. caTl 



1. Sentence splitting: a simple rule-based method is used for sentence splitting. Docu- 
ments are chunked at the dot, exclamation and question mark signs. 

2. Sentence filtering: words are converted to lowercase and cleared up from sloppy punc- 
tuation. Words with less than 2 occurrences (/ < 2) are eliminated {Hapax legomenon 
presents once in a document). Words that do not carry meaning such as functional or 
very common words are removed. Small stop-lists (depending of language) are used in 
this step. 

3. Word normalization: remaining words are replaced by their canonical form using 
lemmatization. stemming, Ultra-stemming or none of them (raw text). 

4. Text Vectorization: Documents are vectorized in a matrix 5[pxAr] of P sentences and N 
columns, that represent the occurrences of a letter (Ultra-stemming) or a word (Lemma- 
tization/Stemming/Raw) j, j = 1, 2, TV in the sentence i, i — 1, 2, P. 

5. Summary generation: each summary is generated by a summarizer based on VSM. 
For Ultra-stemming using n ~ 1 (FiXi), the maximum dimension N may be up to 32 letters. 



This generates very compact and efhcient matrices, as discussed in 3.4 



A federal judge Monday found President Clinton in civil contempt of court for lying in a 
deposition about the nature of his sexual relationship with former White House intern Monica 
S. Lewinsky. Clinton, in a January 1998 deposition in the Paula Jones sexual harassment case, 
swore that he did not have a sexual relationship with Lewinsky. Clinton later explained that 
he did not believe he had lied in the case because the type of sex he had with Lewinsky did not 
fall under the definition of sexual relations used in the case. 



sO/A federal judge Monday found President Clinton in civil contempt of court for lying in a 
deposition about the nature of his sexual relationship with former White House intern Monica 
S. Lewinsky. 

sl/Clinton, in a January 1998 deposition in the Paula Jones sexual harassment case, swore that 
he did not have a sexual relationship with Lewinsky. 

s2/Clinton later explained that he did not believe he had lied in the case because the type 
of sex he had with Lewinsky did not fall under the definition of sexual relations used in the case. 



sO/feder judg monday found presid clinton civil contempt court lying in deposit natur sexual 
relationship former white hous intern monica lewinski 

si /clinton januari deposit paula jone sexual harass case swore sexual relationship lewinski 
s2/clinton explain believ lie case type sex lewinski fall denit sexual relat case 

sO/fjmfpccccldnsrfwhiml 

sl/cjdpjshcssrl 

s2/cleblctslfdsruc 



letter :c defhij Imnprsuw 
sO: 400011122111101 
si: 210010210011200 
s2: 311000030001210 



Table 3: Example of some pre-processings (Stemming, Ultra-stemming and matrix generation) 
applied to the document NYT19990412.0403 from DUG 2006. Document is split in sentences; 
punctuation and case are removed; words are normalized. 

For comparison, four methods of normalization were applied after filtering: 



• Lemmatization by simple dictionary of morphological families: 1.32M words-entries in 
Spanish, 208K words in English and 316K in French. 



• Porter's Stemming, available at Snowball site: http : //snowball . tcortarus . org/textsT] 
,stemmersoverview.htmlJ for English, Spanish, French among other languages. 

• Raw text without normalization. 

• Ultra-stemming: the n first letters of each word. For example, in the case of Ultra- 
stemming of n = 1 (FiXi), inflected verbs "sing", "song", "sings", "singing"... or proper 
names "smith", "snowboard", "sex",... are all replaced by letter "s". 

3.3 Why ultra-stemming could work? 

Although this technique could be considered a brutal destruction of the lexicon. Ultra-stemming 
is, in fact, an extreme stemming. That is, this truncation represents with minimum information, 
what we call the stem of the stem. In the case of Ultra-stemming with n = 1, the construction 
of the vectors-phrases is performed in a space of j — 1, 2, ... 32 classes, which produces a dense 
vector representation. 

Of course, classes are not equally populated. Figures [T] to |3] show the ranking of letters of 
three corpora in English, Spanish and French. The numbers and function words were previously 
removed. 

In an automatic extractive summarizer, the weight of phrases is represented in a suitable 
vector space. However, if the representation is too large, the resulting representation is very 
sparse, which can difficult the weighting of the sentences. Two hypotheses are the basic ideas 
for using Ultra-stemming in automatic summarization task. 



Corpus of 29.34M tokens, 16.72M types, English 

1,8M r 




scpabtdrmfehl iwngouvjkyqzx 
Letter 



Figure 1: Scatter plot of first letter ranking for the English corpus. There are 16.72M of types, 
after filtering of functional words and punctuation. 



Corpus of 21.44M tokens, 4.2M types, Spanish 




Figure 2: Scatter plot of first letter ranking for the Spanish corpus. There are 4.53M of types, 
after filtering of functional words and punctuation. 




Figure 3: Scatter plot of first letter ranking for the French corpus. There are 8.53M of types, 
after filtering of functional words and punctuation. 

The first hypothesis is that a more condensed, but retaining important information of the 
original representation, would enable a more effective weighting for phrases extraction. Ultra- 
stemming produces an extremely compact representation of documents, in a Vector Space that 



can reach only thirty letters, using the representation of one letter per word. One way of 
evaluating the efficacity of a vector representation can be by calculating the density of the 
resulting matrix. This point will be discussed in detail in the next section. The other way is 
to show that two matrices A and B are equivalent in the sense that they contain a number of 
similar informations. If A < B, and A and B represent approximately the same information, 
then it may be preferable to use the representation given by A instead of B. 

Now, how does one know that two matrices contain about the same information? The 
second hypothesis is that if the matrices A and B are correlated, then they probably represent 
similar information. This point will be proved in Section [4] by the Mantel statistic test. 



3.4 Matrix density 

Pre-processing and vectorization of documents will produce very sparse matrices. However the 
density of matrices generated is directly dependent on pre-processing algorithm used. Intu- 
itively, the density of matrices generated by Ultra-stemming must be much greater than those 
generated by classical normalizations. We have calculated the density (5 of a matrix S'[pxA'], of 
P phrases and a vocabulary of N words as a fraction of occurrences Cw of the word w (elements 
other than 0), divided by the size of the matrix p = P x N. The equation [l] calculates the 
density of S. 



(1) S{S) = 



This density can be an indicator of the amount of information in relation to the volume of the 
matrix: lower density implies a greater amount of computation for ranking sentences. As shown 
in table |4] the matrix produced by Ultra-stemming of letters produces a higher average density 
on the studied corpora. The matrices generated by Ultra-stemming are filled approximately 50% 
(56% for English, 64% for Spanish and 47% for French). The volume of the matrix generated 
by each pre-processing method in relation to the size of the matrix in plain text, is given by: 



(2) V- 



p(Raw) 



This volume represents a small fraction (between 5% and 13% depending on the language) of 
the matrix equivalent of plain text. 

In case of the corpus Medicine Clinica the standard matrices (lemm « 101%, stem w 103%) 
are slightly larger than the matrix produced by the plain text (raw) . This can be explained by 
the presence of Hapax legomenon. In the case of plain text, a large number of Hapax (/ = 1) 
is eliminated and it can produce matrices slightly smaller. 



DUC'04 






= 238.0 




Size 


Volume V 


Pre-processing 


Density d 




{N) 


P = 


(P) X (iV) 


Raw=100%) 


Lemmatization 


2.6% 




405.5 




96 509.0 


96.0% 




9 /l% 




/ll K 9 




QQ R 
yy ooi.D 


QQ n% 
yy.u /o 


Raw 


Z.o /o 








1 nn OS"? /I 


inn n% 

iUU.U /o 


FIXi 


55.6% 




25.6 




6 092 8 


6.0% 


Medicina Clmica 






= 88.6 




Size 


Volume V 


Pre-processing 


Density d 




(N) 


P = 


(P) X (iV) 


Raw=1007o 


Lemmatization 


5.9% 




177.0 




15 682.2 


101.3% 


Stemming 


K "70/ 








io OOD.U 


1 n9 aoy 


Raw 


5.1% 




174.7 




15 478.4 


100.0% 


FIXi 


63.7% 




22.2 




1 966.9 


12.7% 


Pistes 




{P) 


= 319.7 




Size 


Volume V 


Pre-processing 


Density 5 




(N) 


P = 


(P) X (N) 


Raw=100% 


Lemmatization 


2.0% 




457.7 




146 326,7 


90.0% 


Stemming 


L9% 




474.5 




151 697.7 


93.0% 


Raw 


L6% 




508.5 




162 567.5 


100.0% 


FIXi 


46.8% 




25.0 




7 992.5 


4.9% 



Table 4: Matrix density for three corpora data. The mean dimension of matrix S, p = (P) x (N) . 
Density 6{S) is calculated by equation [l] and Volume by equation |2] 



Statistics for summarization DUC'04 English, Medicina Clmica Spanish and Pistes French 
corpora, after removing stop-words, Hapax legomenon and punctuation, are shown in table [5] 
The mode of letters per word is 5, 6 and 7, and 6 respectively for each language. 



Corpus 


Words 


Letters 


Mean of letters 


Mode on generic 








per word 


corpus 


DUC'04 


11 956 sentences 


English 


Lemmatization 


137 454 


800 723 


5.83 


• 


Stemming 


137 101 


764 015 


5.57 


• 


Raw 


136 582 


902 914 


6.61 


5 


FIXi 


137 461 


137 461 


1.00 


• 


Medicina Clmica 


4 480 sentences 


Spanish 


Lemmatization 


56 063 


484 281 


8.64 


• 


Stemming 


56 067 


410 048 


7.31 


• 


Raw 


56 115 


526 660 


9.38 


6-7 


FIXi 


56 347 


56 347 


1.00 


• 


Pistes 


16 037 Sentences 


French 


Lemmatization 


167 056 


1 505 169 


9.01 


• 


Stemming 


167 231 


1 264 774 


7.56 


• 


Raw 


167 677 


1 589 190 


9.48 


6 


FIXi 


168 329 


168 329 


1.00 


• 



Table 5: Statistics for three summarization corpora after filtering and removing punctuation. 



Figures |4] [5] and [6] show the average distribution of letters per word by the three summary 



corpora, after the filtering described in 3.2 Curves are shown normalized between [0, 1] for the 
large generic and representative of the language corpora (cf Section 3.1) and the corpora used 
in each of the summaries experiments. 



Corpus English 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Size of word (in letters) 

Figure 4: Scatter plot of mean length of words for two English corpora (heterogeneous and 
summarization raw corpora after filtering). 



Corpus Spanish 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Size of word (in letters) 



Figure 5: Scatter plot of mean length of words for two Spanish corpora (heterogeneous and 
summarization raw corpora after filtering). 



Corpus French 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 



Size of word (in letters) 



Figure 6: Scatter plot of mean length of words for two French corpora (heterogeneous and 
summarization raw corpora after filtering). 



4 Matrix test correlation: the test of Mantel 

Different methods of data analysis as ranking are based on distance matrices. f6] indicates: 
"In some cases, researchers may wish to compare several distance matrices with one another in 
order to test a hypothesis concerning a possible relationship between these matrices. However, 
this is not always evident. Usually, values in distance matrices are, in some way, correlated 
and therefore the usual assumption of independence between objects is violated in the classical 
tests approach. Furthermore, often, spurious correlations can be observed when comparing two 
distances matrices." 

As ^6, shows, in the Mantel test [33_, the null hypothesis is that distances in a matrix A are 
independent of the distances, for the same objects, in another matrix B. In other words, we are 
testing the hypothesis that the process that has generated the data is or is not the same in the 
two sets. Then, testing of the null hypothesis is done by a randomization procedure in which the 
original value of the statistic is compared with the distribution found by randomly reallocating 
the order of the elements in one of the matrices. The measure used for the correlation between 
A and B is the Pearson correlation coefficient: 

(3) KAS) = ;^EE 

^=l j=l 

where P is the number of elements in the lower (upper) triangular part of the matrix, (A) is 
mean for A elements and aA is the standard deviation of A elements. 

Coefficient r > measures the linear correlation and hence is subject to the same statistical 
assumptions. Consequently, if non- linear relationships between matrices exist, they will be 
degraded or lost (r < 0). The testing procedure for the simple Mantel test goes is the same of 



A, - {A) 



OA 



B^., - {B) 



OB 



[B], and it is as follows: Assume two symmetric dissimilarity matrices A and B of size [P x P]. 
The rows and columns correspond to the same objects. 

1. Compute the Pearson correlation coefficient r{A, B) between the corresponding elements 
of the lower-triangular part of the A and B, using equation [3] 

2. Permute randomly rows and the corresponding columns of the matrix A, creating a new 
matrix A'. 

3. Compute r{A' ,B) between matrices A' and B. 

4. Repeat steps 2 and 3 a great number of times. This will constitute the reference distri- 
bution under the null hypothesis. 

The calculation of the correlation between the matrix generated by the Ultra-stemming and 
others normalization methods is not straightforward, because the matrices are not square. In 
general, the matrix produced by the Ultra-stemming have a smaller number of columns than the 
other ones. Then, to calculate a correlation between matrices of different number of columns, 
each matrix must be converted in a symmetric matrix. 

Let (Sfp^jy,] of P rows and N' columns be a matrix produced by Ultra-stemming, and let 
S[pyiN\ of P rows and N columns, be a matrix produced by a classic method of normalization 
such that stemming, lemmatization, etc. We have the condition that: A^' < N . Let the new 
matrices be A[pxp] = 5x S"^ and -B[pxP] = Sx S'^ . They are square symmetrical. A standard 
Mantel test can indicate the degree of similarity between A and B. If the similarity is high 
(r > 0) with a high confidence value (p — > 0), means that the information of the matrix A is 
substantially the same as that contained in the matrix B. In other words, we could replace S' 
for S, for purposes of a vector representation of documents. 

Tables [6] [7] and |8] show the correlation of the Mantel test for the three summary corpora 
studied. The correlation was calculated between the matrices S generated by lemmatization 
(Lemm), stemming (Stem), plain text (Raw) and the matrix S' generated by Ultra-stemming 
FiXi using the initial letter. In all cases the correlation is positive with p- value < 0.001, which is 
significant. The calculations were performed with the zt program written in C, of Eric Bonnet 
and Yves Van de Peer^^ |5]. 



DUC'04 


Lemm 


Stem 


Raw 


FIXi 


Lemm 


• 


0.96149 


0.91287 


0.51904 


Stem 


0.96149 


• 


0.94492 


0.53800 


Raw 


0.91287 


0.94492 


• 


0.46914 


FIXi 


0.51904 


0.53800 


0.46914 


• 



Table 6: Mantel test correlation for corpus DUC'04 data (English, p-value=0.001). 



zt: a software tool for simple and partial Mantel tests. This program can be downloaded from the Web site 
: //bioinf ormatics .psb .ugent .be/sof tware/details/ZTI 



Medicina Clmica 


Lemm 


Stem 


Raw 


FIXi 


Lemm 


• 


0.96725 


0,91174 


0.58541 


Stem 


0.96725 


• 


0,91942 


0,49614 


Raw 


0,91174 


0,91942 


• 


0,51503 


FIXi 


0,58541 


0,49614 


0,51503 


• 



Table 7: Mantel test correlation for corpus Medicina CUnica data (Spanish, p-value=0.001). 



Pistes 


Lemm 


Stem 


Raw 


FIXi 


Lemm 


• 


0.93016 


0.85708 


0.53801 


Stem 


0.93016 


• 


0,89499 


0.51641 


Raw 


0.85708 


0,89499 


• 


0.45156 


FIXi 


0,53801 


0.51641 


0.45156 


• 



Table 8: Mantel test correlation for corpus Pistes data (French, j3-value=0.001). 

According to these correlations, in DUC'04 English corpus, the method Fixi is more cor- 
related with Stemming normalization. In Spanish and French corpora, Fixi seems slightly 
correlated with the model lemmatization. This is intuitively correct and according to the re- 
duced variability of English in relation to Spanish or French. 

5 Experiments 

Ultra-stemming method described in the previous section has been implemented and evaluated 
in several corpora in English, Spanish and French languages. The following subsections present 
details of the different experiments. 

5.1 Summarizers 

Three summarization systems were used in our experiments: Cortex, Enertex and Artex. 
All systems have used the same text representation based on Vector Space Model, described in 
Section O 

• Cortex is a single-document summarization system using several metrics and an optimal 
decision algorithm |43l ST] . 

• Enertex is summarization system based in Textual Energy concept [13]: text is repre- 
sented as a spin system where spins f represents words that their occurrences are / > 
(spins I if the word is not present) . 

• Artex {AnotheR TEXt summarizer) is a single-document summarization system that 
computes the score of a sentence by calculating a dot product between a sentence vector 
and a frequencies vector, multiply by lexical used. 

We have conducted our experimentation with the following languages, summarization tasks, 
summarizers and data sets: 1) Generic multi-document-summarization in English with the 
corpus DUC'04; 2) Generic single-document summarization in Spanish with the corpus Medicina 
Clmica and 3) Generic single document summarization in French with the corpus Pistes. 

Then, we have applied the summarization algorithms following the pre-processing algorithm 
and finally, results have been evaluated using Fresa. 



5.2 Summaries Evaluation 



To evaluate the quality of a summary is not an easy task, and remains an open question. DUG 
conferences have introduced the ROUGE evaluation [55^ , wich measures the overlap of n-grams 
between a candidate summary and reference summaries written by humans. In other hand, 
several metrics without references have been defined and experimented at DUG and TAG^^ 
workshops. Fresa measure [32^ is similar to RouGE evaluation but it does not uses reference 
summaries. It calculates the divergence of probabilities between the candidate summary and 
the document source. Among these metrics, KuUback-Leibler (KL) and Jensen-Shannon (JS) 
divergences have been used [291 I42| to evaluate the informativeness of summaries. In this 
paper, we use Fresa, based in KL divergence with Dirichlet smoothing, like in the 2010 and 
2011 INEX edition [33], to evaluate the informative content of summaries by comparing their 
n-gram distributions with those from source documents. 

Fresa only considered absolute log-diff between frequencies. Let T be the set of terms in 
the source. For every t G T , we denote by its occurrences in the source and by its 
occurrences in the summary. The Fresa package computed the divergence between the source 
and the summaries as: 



To evaluate the quality of generated summaries, several automatic measures were computed: 

• FresAi: Unigrams of single stems after removing stop- words. 

• FRESA2: Bigrams of pairs of consecutive stems (in the same sentence). 

• FRESA5t/4: Bigrams with 2-gaps also made of pairs of consecutive stems but allowing the 
insertion between them of a maximum of two stems. 

• (Fresa) = Fresai+Fresa2+Fresasl/4 ^^^g j^^g^n of Fresa values, and represents the final 
score in our experiments. 

The scores of Fresa are normalized between and 1. High values mean less divergence 
regarding the source document summary, reflecting a greater amount of information content. 
All summaries produced by systems were evaluated automatically using Fresa package. 

5.3 Results 

Below we present separate results for the three languages. In this way, we have analyzed 
linguistic phenomena specific to each language. 

5.3.1 English corpus 

Results in figure [7] show that Ultra-steming improves the score of the three automatic summa- 
rizer systems. This result is remarkable for FiXi, whose average matrix represents only 6% of 
the matrix volume in plain text. 
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Figure 7: Histogram plot of content evaluation for corpus DUG 2004 Task 2. with (Fresa) 
measures, for each summarizer and each normalization. 
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Figure 8: Scatter plot of (Fresa) mean of Ultra-stemming using n first letters (corpus DUG 
2004 Task 2, Cortex summarizer). 

As shown in Figure [7] the performance of the three summarizers is improved using the 
Ultra-stemming in relation to other normalizations. So, in particular, using lemmatization (the 
best score between the two classic normalizations), the summarizer Artex, goes from 0.02451 




to 0.02697 using normalization FiXi, i.e. an increase of 10%. Cortex increases of 0.02435 to 
0.02682, an augmentation of 10.1% and summarizer Enertex increases of 0.02141 to 0.02626, 
an augmentation of 22.7%. 

A detailed analysis for a particular summarizer is shown in Figure |8] This figure shows 
the average score Fresa obtained on DUC'04 English corpus, in function of Ultra-stemming 
used, of n = 1,2,..., 14 letters, for the automatic summarizer Cortex. By comparison, the 
values Fresa for lemmatization (lemm), stemming (stem) and plain text (raw) are shown in 
the graph. 



5.3.2 Spanish corpus 

Spanish is a language with greater variability than English. Results in figure [9] shown that ultra- 
steming improves the score of the three systems of automatic summarization utilized. In the 
case of summarizers Cortex and Artex, stemming and lemmatization substantially obtains 
the same scores, which does not occur with Enertex. However, comparing Ultra-stemming 
against stemming FiXi, the three summarizers are benefiting of an increased score (Artex 5%, 
Enertex 5.25% and Cortex 7.11 %). 
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Figure 9: Histogram plot of content evaluation for Spanish corpus Medicina Clinica with 
(Fresa) scores for each summarizer. 



Figure 10 shows the mean score (Fresa) on the Spanish corpus Medicine Clinica, based 
on the ultra-stemming {n — 1,2,. ..14 letters) using automatic summarizer Cortex. Values 
Fresa for lemmatization (Lemm), stemming (Stem) and plain text (Raw) are also shown. 



Medicina Clinica, CORTEX Algorithm 

0,23 I 



— 1 1 1 1 1 1 1 1 1 1 1 — 

4 5 6 7 8 9 10 11 12 13 14 



ultra-steming using n first letters 



Figure 10: Scatter plot of (Fresa) mean vs. Ultra-stemming using n first letters (corpus 
Medicina Clinica, Cortex summarizer). 



5.3.3 French corpus 



A 

ffi 0,20 

a: 



Corpus PISTES 
(French - 50 docs - Generic summarization) 




ARTEX 



ENERTEX 
Summarizer 



i 



i 



CORTEX 



fix1 
fix2 
fix3 
fix4 
lemm 
^^stem 



H raw 



Figure 11: Histogram plot of content evaluation for French corpus Pistes with Fresa scores 
for each summarizer. 
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Figure 12: Scatter plot of (Fresa) mean vs. Ultra-stemming using n first letters (corpus 
Pistes, Cortex summarizer). 



Results in figure[TT]show that Ultra-stemming improves the score of the three automatic summa- 
rization systems used. In particular, the summarizer Enertex using a stemming representation 
obtains a score Fresa of 0.25928, and using FiXi representation a score of 0.29311, i.e., an 
increase of more than 13%. 

Finally, Figure 12 shows the detailed mean score (Fresa) on French corpus Pistes, as 
function of n = 1, 2, 14 letters, using the automatic summarizer Cortex. As well, it shows 
the values Fresa for lemmatization (Lemm), stemming (Stem) and plain text (Raw). 

Overall for the three languages, beyond a certain number of letters (5 for English, 7 for 
the Spanish and 6 for French) Ultra-stemming loses its effectiveness and lemmatization score is 
higher. A view to the table[5]shows that this limit has a relationship with the mean, rather than 
the mode of letters per word in each language. Apparently, using Ultra-stemming is interesting 
when using a number of characters less than the mode of the language in question. 



6 Discussion and conclusion 

In this paper we have introduced and tested a simple pre-processing method suitable for au- 
tomatic summarization text. Ultra-stemming is fast and simple. It reduces the size of the 
matrix representation, but it retains the information and charateristics of the document. An 
important aspect of our approach is that it does not requires linguistic knowledge or resources 
which makes it a simple and efficient pre-processing method to tackle the issue of Automatic 
Text Summarization. 

And what about times ? 

In general, the processing times of Ultra-stemming FiXi are shorter compared to all others 
methods. Of course, processing time depends of summarizer algorithm and pre-processing 



algorithm. In general, processing time t is function of: 

T = time(filtering)+time(normalization)+time(summarizer) 

In our experiments, time(filtering) is independent of the summarizers and generally, filter- 
ing algorithm is very fast. The time (normalization) depends on algorithm used (stemming, 
lemmatizaton) and/or extern resource (dictionary of lemmatization). The time(summarizer) is 
intrinsic to each summarizer system. 

By example. Cortex is a very fast summarizer with 0{\ogp^) (where p ~ P x N), and 
processing times for Stemming, Raw and FiXi are close. In other hand, Enertex summarizer 
has a complexity of 0{p^), then it needs more time to process the same corpus. In this case, 
Ultra-stemming is a very interesting alternative to summarize long corpora. Table [9] shows 
processing times for each corpus, following the normalization method for Cortex, Artex and 
Enertex summarizers. All times are measured in a 7.8 GB of RAM computer. Core i7-2640M 
CPU @ 2.80GHz X 4 processor, running under 32 bits GNU/Linux (Ubuntu Version 12.04). 



Summarizer 


Corpus 


Time 


Cortex 


DUC'04 


Medicina Clinica 


Pistes 


Mean (All) 


Lemmatization 


0.80' 


2.88' 


1.13' 


1.60' 


Stemming 


0.40' 


0.26' 


0.53' 


0.54' 


Raw 


0.33' 


0.26' 


0.41' 


0.40' 


FIXi 


0.31' 


0.26' 


0.38' 


0.32' 


Artex 


DUC'04 


Medicina Clinica 


Pistes 


Mean (All) 


Lemmatization 


1.71' 


3.10' 


2.70' 


2.50' 


Stemming 


1.35' 


0.40' 


2.11' 


1.29' 


Raw 


1.30' 


0.38' 


2.13' 


1.27' 


FIXi 


0.41' 


0.28' 


0.51' 


0.40' 


Enertex 


DUC'04 


Medicina Clinica 


Pistes 


Mean (All) 


Lemmatization 


9.25' 


3.38' 


18.63' 


10.42' 


Stemming 


9.28' 


0.75' 


18.38' 


9.47' 


Raw 


9.16' 


0.73' 


20.76' 


10.22' 


FIXi 


3.93' 


0.46' 


8.35' 


4.25' 



Table 9: Statistics of processing times (in minutes) of three summarizers over three corpora. 

Clearly, the lemmatization of a large dictionary is the most time-consuming strategy. This 
is notable in the Spanish corpus, using a 1.3M dictionary entries. Lemmatization is at the same 
time, the strategy that produces the best results after the Ultra-stemming (Fix„ with n = 1...4 
letters). In the case of Artex summarizer, the gain in time is dramatic, going from 2.50' using 
lemmatization to 0.40 using Fixi, i.e. a gain of 625%. This gain is 500% for Cortex and 245% 
for Enertex. 

From our point of view, the Ultra-stemming of n letters has three important advantages: 

1 . A reduction of the space and the calculation time of automatic summarization algorithms 
based on the vector space model. 

2. Improving of summary content, when using n < mode in letters per word of each language. 

3. Applications on resource sparse languages. Typically tt languages where no lemmatizers, 
stemmers or parsers, neither corpora nor native linguist available, the Ultra-stemming 
can be an attractive alternative for automatic document summarizers. 



Summarization using the Ultra-stemming representation for sentence scoring, improve the 
identification of most relevant sentences from documents. The results obtained on corpora in 
English, Spanish and French prove that Ultra-stemming can achieve good results for content 
quality. Tests with other corpora (DUG evaluation campaigns, TAG, INEX, etc.) in mono-and 
multi-document guided by a subject, and tt languages (Nahuatl, Maya, Somali, Interlingua, 
etc.) using content evaluation with or without reference summaries still in progress. 
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